Cleaning noisy wordnets
نویسندگان
چکیده
Automatic approaches to creating and extending wordnets, which have become very popular in the past decade, inadvertently result in noisy synsets. This is why we propose an approach to detect synset outliers in order to eliminate the noise and improve accuracy of the developed wordnets, so that they become more useful lexico-semantic resources for natural language applications. The approach compares the words that appear in the synset and its surroundings with the contexts of the literals in question they are used in based on large monolingual corpora. By fine-tuning the outlier threshold we can influence how many outlier candidates will be eliminated. Although the proposed approach is language-independent we test it on Slovene and French that were created automatically from bilingual resources and contain plenty of disambiguation errors. Manual evaluation of the results shows that by applying a threshold similar to the estimated error rate in the respective wordnets, 67% of the proposed outlier candidates are indeed incorrect for French and a 64% for Slovene. This is a big improvement compared to the estimated overall error rates in the resources, which are 12% for French and 15% for Slovene.
منابع مشابه
Developing and Maintaining a WordNet: Procedures and Tools
In this paper we present a set of tools that will help developers of wordnets not only to increase the number of synsets but also to ensure their quality, thus preventing it to become obsolete too soon. We discuss where the dangers lay in a WordNet production and how they were faced in the case of the Serbian WordNet. Developed tools fall in two categories: first are tools for upgrade, cleaning...
متن کاملData Cleaning: Approaches for Earth Observation Image Information Mining
Actually the growing volume of data provided by different sources some times may present inconsistencies, the data could be incomplete with lack of values or containing aggregate data, noisy containing errors or outliers, etc. Then data cleaning consist in filling missing values, smooth noisy data, identify or remove outliers and resolve inconsistencies. In more general definition, data cleanin...
متن کاملZipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora
We introduce Zipporah, a fast and scalable data cleaning system. We propose a novel type of bag-of-words translation feature, and train logistic regression models to classify good data and synthetic noisy data in the proposed feature space. The trained model is used to score parallel sentences in the data pool for selection. As shown in experiments, Zipporah selects a high-quality parallel corp...
متن کاملHow Much Noise in Text is too Much: A Study in Automatic Document Classification
Noise is a stark reality in real life data. Especially in the domain of text analytics it has a significant impact as data cleaning forms a very large part (upto 80% time) of the data processing cycle. Noisy unstructured text is common in informal settings such as on-line chat, SMS, email, newsgroups and blogs, automatically transcribed text from speech data, and automatically recognized text f...
متن کاملTunable Distortion Limits and Corpus Cleaning for SMT
We describe the Uppsala University system for WMT13, for English-to-German translation. We use the Docent decoder, a local search decoder that translates at the document level. We add tunable distortion limits, that is, soft constraints on the maximum distortion allowed, to Docent. We also investigate cleaning of the noisy Common Crawl corpus. We show that we can use alignment-based filtering f...
متن کامل